NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

NotebookOS: A Replicated Notebook Platform for Interactive Training with On-Demand GPUs

Carver, Benjamin; Zhang, Jingyuan; Wang, Haoliang; Mahadik, Kanak; Cheng, Yue (March 2026, The ACM International Conference on Architectural Support for Programming Languages and Operating Systems)

Interactive notebook programming is universal in modern ML and AI workflows, with interactive deep learning training (IDLT) emerging as a dominant use case. To ensure responsiveness, platforms like Jupyter and Colab reserve GPUs for long-running notebook sessions, despite their intermittent and sporadic GPU usage, leading to extremely low GPU utilization and prohibitively high costs. In this paper, we introduce NotebookOS, a GPU-efficient notebook platform tailored for the unique requirements of IDLT. NotebookOS employs replicated notebook kernels with Raft-synchronized replicas distributed across GPU servers. To optimize GPU utilization, NotebookOS oversubscribes server resources, leveraging high inter-arrival times in IDLT workloads, and allocates GPUs only during active cell execution. It also supports replica migration and automatic cluster scaling under high load. Altogether, this design enables interactive training with minimal delay. In evaluation on production workloads, NotebookOS saved over 1,187 GPU hours in 17.5 hours of real-world IDLT, while significantly improving interactivity.
more » « less
Free, publicly-accessible full text available March 22, 2027
RECON: Training-Free Acceleration for Text-to-Image Synthesis with Retrieval of Concept Prompt Trajectories

https://doi.org/10.1007/978-3-031-73202-7_17

Lu, Chen-Yi; Agarwal, Shubham; Tanjim, Md Mehrab; Mahadik, Kanak; Rao, Anup; Mitra, Subrata; Saini, Shiv Kumar; Bagchi, Saurabh; Chaterji, Somali (November 2024, Springer Nature Switzerland)

Full Text Available
SandPiper: A Cost-Efficient Adaptive Framework for Online Recommender Systems

https://doi.org/10.1109/BigData55660.2022.10020465

Thinakaran, Prashanth; Mahadik, Kanak; Gunasekaran, Jashwant; Taylan Kandemir, Mahmut; Das, Chita R. (December 2022, 2022 IEEE International Conference on Big Data (Big Data))

Online recommender systems have proven to have ubiquitous applications in various domains. To provide accurate recommendations in real time it is imperative to constantly train and deploy models with the latest data samples. This retraining involves adjusting the model weights by incorporating newly-arrived streaming data into the model to bridge the accuracy gap. To provision resources for the retraining, typically the compute is hosted on VMs, however, due to the dynamic nature of the data arrival patterns, stateless functions would be an ideal alternative over VMs, as they can instantaneously scale on demand. However, it is non-trivial to statically configure the stateless functions because the model retraining exhibits varying resource needs during different phases of retraining. Therefore, it is crucial to dynamically configure the functions to meet the resource requirements, while bridging the accuracy gap. In this paper, we propose Sandpiper, an adaptive framework that leverages stateless functions to deliver accurate predictions at low cost for online recommender systems. The three main ideas in Sandpiper are (i) we design a data-drift monitor that automatically triggers model retraining at required time intervals to bridge the accuracy gap due to incoming data drifts; (ii) we develop an online configuration model that selects the appropriate function configurations while maintaining the model serving accuracy within the latency and cost budget; and (iii) we propose a dynamic synchronization policy for stateless functions to speed up the distributed model retraining leading to cloud cost minimization. A prototype implementation on AWS shows that Sandpiper maintains the average accuracy above 90%, while 3.8× less expensive than the traditional VM-based schemes.
more » « less
Full Text Available
AutoForecast: Automatic Time-Series Forecasting Model Selection

https://doi.org/10.1145/3511808.3557241

Abdallah, Mustafa; Rossi, Ryan; Mahadik, Kanak; Kim, Sungchul; Zhao, Handong; Bagchi, Saurabh (October 2022, 31st ACM International Conference on Information & Knowledge Management)

Full Text Available

Search for: All records